Model Selection

Zero-Shot Learning

# Zero-Shot Learning

Nalgotic Dreams is a text-to-image model based on Stable Diffusion XL, specializing in generating high-quality anime-style images, particularly bright and detailed illustrations of girl characters.

Image Generation English

Openvision Vit Small Patch8 384

OpenVision is a fully open, cost-effective family of advanced vision encoders focused on multimodal learning.

Multimodal Fusion

Openvision Vit Tiny Patch8 224

OpenVision is a fully open, cost-effective advanced vision encoder family focused on multimodal learning.

Multimodal Fusion

Openvision Vit Tiny Patch16 384

OpenVision is a fully open, cost-effective advanced vision encoder family focused on multimodal learning.

THUDM.GLM 4 32B 0414 GGUF

GLM-4-32B-0414 is a large-scale language model developed by THUDM, with 32 billion parameters, suitable for various text generation tasks.

Large Language Model

PURE is the first framework to employ a Multimodal Large Language Model (MLLM) as the backbone network for solving low-level vision tasks.

Image Enhancement

Qwen.qwen2.5 VL 72B Instruct GGUF

Qwen2.5-VL-72B-Instruct is a large-scale vision-language model developed by the Tongyi Qianwen team, supporting multimodal understanding and generation of images and text.

Vit So400m Patch16 Siglip 512.v2 Webli

A vision Transformer model based on SigLIP 2, designed for image feature extraction and suitable for multilingual vision-language tasks.

Vit So400m Patch14 Siglip Gap 378.v2 Webli

Vision Transformer model based on SigLIP 2 architecture, pre-trained on WebLI dataset, with attention pooling head removed and global average pooling applied

Image Classification

Vit Base Patch16 Siglip 256.v2 Webli

A ViT image encoder based on SigLIP 2 for extracting image features, supporting multilingual vision-language tasks.

Phi 4 Model Stock V2

Phi-4-Model-Stock-v2 is a large language model merged from multiple Phi-4 variant models using the model_stock merging method, demonstrating strong performance across multiple benchmarks.

Large Language Model

This model is a video language-guided reasoning segmentation model developed based on LLaVA-Phi-3-mini-4k-instruct, focusing on object segmentation tasks in videos.

Safetensors English

OmniGen is a unified image generation model that supports multiple image generation tasks.

Image Generation

Lumina Mgpt 7B 768

Lumina-mGPT is a family of multimodal autoregressive models, excelling in generating flexible and realistic images from text descriptions, and capable of performing various vision and language tasks.

Mambavision B 1K

PAVE is a model focused on repairing and adapting video large language models, aiming to enhance the conversion capability between video and text.

Llama3 Med42 8B

Med42-v2 is a clinically aligned large language model suite developed by M42, based on the LLaMA-3 architecture, available in 8B or 70B parameter versions, designed to provide high-quality medical Q&A capabilities.

Large Language Model

Transformers English

Llama3 Med42 70B

Med42-v2 is an open-access clinical large language model suite developed by M42, built on LLaMA-3, containing 8 billion or 70 billion parameters, capable of high-quality medical question answering.

Large Language Model

Transformers English

Resnet50 Facial Emotion Recognition

This is an AI model released under the Apache-2.0 license, with specific functionalities to be determined based on the actual model type

Large Language Model

KhaldiAbderrhmane

Libra is a decoupled vision system built upon large language models, possessing fundamental multimodal understanding capabilities.

Yuna AI is a virtual companion model designed for emotional connection, offering deep interactive experiences beyond traditional assistants.

Large Language Model Supports Multiple Languages

Vitamin XL 256px

ViTamin-XL-256px is a vision-language model based on the ViTamin architecture, designed for efficient visual feature extraction and multimodal tasks, supporting high-resolution image processing.

MoAI is a large-scale language and vision hybrid model capable of processing both image and text inputs to generate text outputs.

Llava Maid 7B DPO GGUF

LLaVA is a large language and vision assistant model capable of handling multimodal tasks involving images and text.

EraseDraw is a diffusion model-based image editing tool capable of inserting or modifying objects in images based on text prompts.

Image Generation

Supermario Slerp V2

supermario-slerp-v2 is a text generation model created by merging two 7B-parameter models using the SLERP method, demonstrating outstanding performance across multiple benchmarks.

Large Language Model

Transformers English

Vit Gpt2 Image Captioning

This is an image captioning model based on the Vision Encoder-Decoder architecture, capable of generating natural language descriptions for input images.

An instruction-tuned version based on Stable Diffusion v1.5, specifically designed for image cartoonization

Image Generation Other

instruction-tuning-sd

This model is used to predict the relevance of query-document pairs, suitable for information retrieval tasks.

Chinese Clip Vit Base Patch16

The base version of Chinese CLIP, using ViT-B/16 as the image encoder and RoBERTa-wwm-base as the text encoder, trained on a large-scale dataset of approximately 200 million Chinese image-text pairs.

This is an image classification model generated by HuggingPics, specifically designed to recognize specific categories of images.

Image Classification

Llama Horse Zebra

This is an image classification model generated by HuggingPics, capable of accurately identifying animals such as horses, llamas, and zebras.

Image Classification

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase